Identification of pharmaceutical targets

ABSTRACT

In order to identify pharmaceutical targets, at least one correlation between the expression rates of different genes of a cell is ascertained by evaluating a plurality of gene expression patterns. In this case, correlations of second or higher order are considered. The correlations make it possible to infer causal relationships between different genes and the associated proteins. The regulatory network of the cell being studied can be therefore deduced from the correlations. Suitable targets can be identified from the regulatory network which has been deduced in such a way.

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application is based on and hereby claims priority to GermanApplication No. 101 59 262.0 filed on Dec. 3, 2001, the contents ofwhich are hereby incorporated by reference.

BACKGROUND OF THE INVENTION

[0002] The human genome comprises approximately 20,000 to 80,000 genes,which contain the genetic code for about one million proteins. In thespecialized cells of the body, only subsets of the total number of genesare actually read (expressed) in each case. Taken together, the proteinsproduced in this way are referred to as the proteome of this cell. Themutual interaction of the proteins, as well as their interaction withthe DNA, represents the most important part of the mechanism governingthe development of the human body-from the fertilized ovum, as well asall the bodily functions. In terms of information technology, the genometherefore represents a procedural code for the structure and function ofthe human body.

[0003] Many diseases and dysfunctions of the body are due to problemswith the functional network made up of the genome and the proteome.Therefore, some medications act as agonists or antagonists for specifictarget proteins, that is to say they increase or decrease the functionof a protein, with the aim of returning the network formed by theproteome and genome to a normal mode of function. These target proteinshave to date been derived according to heuristic principles frombiochemical considerations. It is in this case often unclear whether thedysfunction of a protein actually represents the cause of the disease,or whether it only represents one of the symptoms of a concealedmisregulation at another point of the network.

[0004] For the development of improved therapies, therefore,quantitative understanding of the interaction between the genome and theproteome is necessary.

SUMMARY OF THE INVENTION

[0005] It is one possible object of the invention to improve theidentification of proteins that are suitable as a target for medicinaltreatment of genetically related diseases or problems.

[0006] In order to identify pharmaceutical targets, at least onedependency or statistical correlation between the expression rates ofdifferent genes of a cell is ascertained by evaluating a multiplicity ofgene expression patterns. In this case, inter alia, correlations ofsecond or higher order are considered. The dependencies make it possibleto infer causal relationships between different genes and the associatedproteins. The regulatory network of the cell being studied can thereforebe deduced from the dependencies.

[0007] In this way, it is possible to identify genes which most probablyinitiate regulatory cascades, or which are responsible for complexchanges in the expression patterns, for example in the event of agenetically related disease.

[0008] The method therefore makes it possible to identify targets on asystematic basis. This is done by statistical modeling of the regulatorygenetic network using a structure-learning causal network on the basisof gene expression patterns.

[0009] The described method does not rely on information as a functionof time, and it can therefore be applied to a wide basis of geneexpression measurements.

[0010] The described method is usually carried out with the aid of acomputer.

[0011] The method and system are particularly suitable for supplementinghigh throughput drug discovery methods in biotechnology. Anotherapplication relates to the field of assisting tumor diagnosis and tumortreatment. It is possible to study regulatory relationships both in thehuman body and in any other living being, whether animal or vegetable,bacterium or another cell.

[0012] The individual measurements of the gene expression patterns arein this case regarded as mutually independent. They represent randomvalues which are produced by an unknown high-dimensional probabilitydistribution. Complete characterization of the statistical structure,that is to say of the correlations of the gene expression rates, withthe aid of the measured gene expression patterns is equivalent toestimating the composite high-dimensional probability distribution forthese patterns. If a measurement involves determining the expression of5,000 genes, then a 5,000-dimensional probability density needs to beestimated, which most generally entails great difficulties.

[0013] Causal networks assume that conditional independencies exist inthe data. There is a conditional independency whenever two randomvariables are mutually independent under the condition that all theother random variables are kept constant, that is to say higher-ordercorrelations via a multistage feedback loop between the two randomvariables are neglected. The full probability density can then bereplaced by a product of lower-dimensional probability densities.

[0014] A particularly efficient way of deducing the correlations ordependencies between the individual random variables, that is to say theexpression rates, of the high-dimensional probability distributioninvolves firstly assuming a set of independent random variables.Successively, the correlation which most reduces the error of thenetwork for the explanation of new data (generalization error) is addedto the network in each case. This means that those correlations forwhich the actually measured gene expression patterns have the highestprobability under all conceivable probability distributions are assumed.This is continued until the generalization error can be further reducedonly within a predetermined threshold.

[0015] One preferred, simple embodiment of the search strategies for thecorrelations is carried out with the aid of the following steps:

[0016] firstly, the single edge which minimizes the generalization erroris looked for, that is to say the best first edge,

[0017] the best second edge is subsequently looked for,

[0018] etc., until the generalization error can no longer be improvedsignificantly.

[0019] In this way, it is possible to deduce both the correlationsbetween the random variables (expression rates) and also the shape ofthe high-dimensional probability distribution, at least qualitatively inthe latter case. The deduction of the correlations between the randomvariables, with the possibility of representing these correlations withthe aid of at least partially directed graphs, is referred to asstructure learning, since the structure of the regulatory network islearnt during this.

[0020] When successively adding correlations, it is possible to employexisting knowledge about regulatory relationships. In this way, thededuction of the regulatory relationships can be made faster and moreaccurate.

[0021] This algorithm, which is very time-consuming, especially forhigh-dimensional data, can be accelerated decisively by fast,quasi-optimal search strategies for important dependencies. One knownalgorithm for this is the greedy algorithm (T. H. Cormen, C. ELeiserson, R. L. Rivest, C. Stein: “Introduction to Algorithms”, 2ndedition McGraw-Hill Columbus, Ohio (2001)).

[0022] By artificial modification of individual gene expression rates,the most probable resulting gene expression pattern can be predictedfrom the structure of the regulatory network, that is to say of thehigh-dimensional probability distribution, calculated from thepreviously available data. This can be compared with measurements ofdiseased tissue (for example tumor tissue). In this way, it is possibleto infer the gene group originally lying at the cause of apathologically modified cellular function, or possibly the single genelying at the cause, and to identify the associated protein as the targetof a medicinal treatment.

BRIEF DESCRIPTION OF THE DRAWINGS

[0023] These and other objects and advantages of the present inventionwill become more apparent and more readily appreciated from thefollowing description of the preferred embodiments, taken in conjunctionwith the accompanying drawings of which:

[0024]FIG. 1 schematically shows the regulatory processes whichdetermine the expression pattern of a cell;

[0025]FIG. 2 shows a directed acyclic graph; and

[0026]FIG. 3 illustrates ways of determining the direction of edges in adirected acyclic graph.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

[0027] Reference will now be made in detail to the preferred embodimentsof the present invention, examples of which are illustrated in theaccompanying drawings, wherein like reference numerals refer to likeelements throughout.

[0028]FIG. 1 shows the most important interactions between genes andproteins of a DNA segment. The interactions are used as the basis fordescribing the genomic regulatory network.

[0029] The upper part of FIG. 1 schematically indicates how an externalsignal acting on the cell from outside—for instance in the scope ofintercellular communication—which is picked up for example by atransmembrane receptor protein (for example by a calcium channel) and istransmitted into the interior of the cell in a suitable way, triggersthe production of the genes A, B, C and D of the DNA segment.

[0030] It is therefore in principle also possible to influence theexpression rate of individual genes of a cell from outside the cells bythe method.

[0031] The term “gene” denotes a not necessarily continuous segment ofthe DNA which contains the genetic code for a protein, or alternativelyfor a group of proteins.

[0032] The process for production of a protein from a gene, for exampleprotein A on the basis of gene A in FIG. 1, is referred to as expressionof this gene. The conversion of the DNA code of the gene into the chainof amino acids of the protein is referred to as translation. The rate atwhich protein A is produced in a given context is known as itsexpression rate.

[0033] Not all the genes are expressed in a cell. Rather, various celltypes differ in terms of their gene expression pattern. This is oftenalso true of the difference between diseased and healthy cells.

[0034] The expression pattern of a cell is determined by the regulatoryprocesses schematically represented in FIG. 1. The regulatory processesare essentially determined by a few important interactions betweenproteins and genes, as well as of the proteins between one another.

[0035] For instance, the expression rate of a gene A may be regulated,that is to say increased, decreased or brought to a stop, by thepresence of another protein B. In this example, protein B has aregulatory effect on gene A, or protein A. Regulatory proteins may, forexample, be constituted by the protein units of activator complexes.Regulatory proteins may also act simultaneously on many target genes.

[0036] A second type of interaction involves the post-translationalmodification of proteins, that is to say the modification of proteinsafter translation. As a rule, post-translational modification of aprotein takes place immediately after the end of translation, that is tosay before the protein becomes active in the cell. For example, manyproteins are phosphorylated or glycolyzed by special enzymes, that is tosay the target protein is brought into its functional state, or put intoa state in which it is no longer active, by adding or removing chemicalgroups. Post-translational modification may also functionally switch aprotein on or off, possibly temporarily.

[0037] In FIG. 1, protein A is a so-called effector protein, that is tosay it acts within the cell on other substances, and not directly on thegenome or proteome. In FIG. 1, protein C hence modifies the function ofthe effector protein A through post-translational modification.

[0038] Protein B is a regulatory protein, since it determines theexpression rate of protein A, by interacting with that DNA segment whichcontains gene A. Protein D hence modifies the function of a regulatoryprotein (protein B) through post-translational modification.

[0039] The nucleic acid sequence of human DNA is substantially known.The genes coded by the DNA are also being identified to an increasingextent. Knowledge about the proteome, including proteins possiblymodified post-translationally by interaction between the proteins, isnot so complete. Nevertheless, recent sequencing and high throughputscreening methods are making rapid identification of further genes andproteins possible.

[0040] Another important step in the clarification of the expressionpatterns of a cell has come about with the development of highthroughput hybridization techniques. In these methods, the expressionrate of many 100 different genes are tested simultaneously on aso-called microarray. With the aid of these methods, it is possible todetermine the gene expression pattern of a cell.

[0041] To that end, as a rule, the mRNA (messenger RNA) synthesized inthe cell is determined. mRNA is an intermediate product duringtranslation of the gene into the protein. mRNA is hence a precursorduring formation of the protein. The cell to be studied is firstlyisolated. It is subsequently broken up. By suitable purification steps,the mRNA from the cell is isolated. The mRNA is then transcribed byreverse transcriptase into cDNA (complementary DNA). The latter isamplified, as a rule by using linear PCR (polymerase chain reaction).The cDNA obtained in this way is qualitatively or quantitativelyanalyzed with the aid of suitable microarrays, for example DNA chips.With modern microarrays, the expression rates of 5,000 or more genes canbe analyzed simultaneously.

[0042] On the basis of these improved techniques, extensive knowledgehas by now become available about the human genome and proteome, as wellas about the interactions between proteins and genes, and of proteinswith one another.

[0043] Some mathematical terms needed for clarification of theregulatory network will firstly be introduced below.

[0044] The expression rates of the individual genes, which aredetermined from the measured gene expression patterns, are the randomvariables to be considered below. For gene i, the random variablerepresenting the expression rate is denoted by X_(i). Values which itcan take are denoted by x_(i). The random vector, which consists of theexpression rates of all k genes, is denoted by $X:={\begin{pmatrix}X_{1} \\\ldots \\X_{k}\end{pmatrix} = \left( {X_{1},\ldots \quad,X_{k}} \right)^{T}}$

[0045] () ^(t) denotes transposition.

[0046] In order to ascertain the correlations between the expressionrates, or the random variables, various moments of the random variablesare considered.

[0047] The first moment of the random vector X, which is also referredto as the expectation value E, is defined by

EX :=(α₁, . . . ,α_(k))^(T) :=(EX₁, . . . , EX_(k))^(T) .

[0048] On the basis of known statistical considerations, the expectationvalue EX_(i) of the expression rates X_(i) is estimated with the aid ofthe arithmetic mean of the observed expression rates x_(i) over nmeasurements of gene expression patterns:${{E^{(s)}X_{i}} = {\frac{1}{n}{\sum\limits_{m = 1}^{n}x_{im}}}},$

[0049] where x_(im) gives the expression rate determined for gene i inmeasurement m, and the superscript index (s) shows that an estimatedvalue is involved.

[0050] The second moments are defined by

α_(1j) :=E(X₁·X_(J)).

[0051] Again, on the basis of known statistical considerations, theexpectation value E(X_(i)·X_(j)) to be calculated for the second momentis estimated with the aid of the following equation:${E^{(s)}\left( {X_{i} \cdot X_{j}} \right)} = {\frac{1}{n}{\sum\limits_{m = 1}^{n}{x_{im} \cdot {x_{jm}.}}}}$

[0052] The second central moment is also referred to as the covariance.It is defined by

cov(X₁, X_(j)):=μ_(1j):=E([X₁−EX_(i)]·[X_(u−EX) _(j)]).

[0053] Owing to the linearity of the expectation value, the followingapplies

cov(X₁, X_(j)):=μ_(1j):=E(X₁·X_(j))−EX₁·EX_(j)=α_(ij)−α_(i)·α_(j).

[0054] The covariance is estimated in a known way by${{cov}^{(s)}\left( {X_{i},X_{j}} \right)} = {\frac{1}{n - 1}{\sum\limits_{m = 1}^{n}{\left( {x_{im} - {E^{(s)}X_{i}}} \right) \cdot {\left( {x_{jm} - {E^{(s)}X_{j}}} \right).}}}}$

[0055] The μ_(ii) are precisely the variances of the individualexpression rates X_(i):

σ₁ ² :=μ_(n) .

[0056] They are estimated in a known way using$\sigma_{i}^{{(s)}\quad 2} = {\mu_{u}^{(s)} = {\frac{1}{n - 1}{\sum\limits_{m = 1}^{n}{\left( {x_{im} - {E^{(s)}X_{i}}} \right)^{2}.}}}}$

[0057] The k×k matrix

cov(X, X):=E([X−EX]·[X−EX]^(T))=E(X·X^(T))−EX·EX^(T)

[0058] is referred to as the covariance matrix of X.

[0059] The correlation of the random variables X_(i) and X_(j) is oftendetermined with the aid of the (second-order) correlation coefficient.This is defined by$\rho_{ij}:={\frac{{cov}\left( {X_{i},X_{j}} \right)}{\sigma_{i} \cdot \sigma_{j}}.}$

[0060] It lies between −1 and +1. It can likewise be estimated by usingthe indicated estimates for the covariance and the variance. A vanishingcorrelation coefficient points to the absence of regulatoryrelationships. A correlation coefficient differing significantly fromzero points to a statistical and therefore regulatory dependency.

[0061] The above definitions can be generalized to third, fourth and anyhigher moments. In particular, the third moment is defined by

α_(ijk) :=E(X_(i)·X_(j)·X_(k)).

[0062] The third central moment is defined by

μ_(ijk) :=E([X_(i)−EX_(j)]·[X_(j)−EX_(j)]·[X_(k)−EX_(k]).)

[0063] It is estimated in a known way by$\mu_{ijk}^{(s)} = {\frac{1}{n - 2}{\sum\limits_{m = 1}^{n}{\left( {x_{im} - {E^{(s)}X_{i}}} \right) \cdot \left( {x_{jm} - {E^{(s)}X_{j}}} \right) \cdot {\left( {x_{km} - {E^{(s)}X_{k}}} \right).}}}}$

[0064] The correlation of the random variables X_(i), X_(j) and X_(k)can likewise be determined with the aid of the third-order correlationcoefficient. This is defined by$\rho_{ijk}:={\frac{\mu_{ijk}}{\sigma_{i} \cdot \sigma_{j} \cdot \sigma_{k}}.}$

[0065] It likewise lies between −1 and +1, and can be estimated in thesame way as the second-order correlation coefficient.

[0066] In an exemplary embodiment, the presence of regulatorydependencies is ascertained by testing the correlation coefficients inrespect of whether they differ significantly from zero. Statisticallyspeaking, the hypothesis that the correlation coefficient vanishes istested. This can be done with the aid of various known statistical testmethods. One method is, for example, described in Bronstein-Semendjajew:“Taschenbuch der Mathematik” (handbook of mathematics), Verlag HarriDeutsch, 22nd edition, 1985, p. 693.

[0067] The described methods generally have the purpose of clarifyingstatistical dependencies or independencies, and thereby extracting thenetwork of influences from the data.

[0068] If protein B regulates gene A and there are no other regulatoryphenomena, then this property is expressed in a statistical correlationor anti-correlation of the two expression rates over variousmeasurements (second-order statistical dependency or correlation).

[0069] The presence of a metaregulator such as protein D in FIG. 1,however, is expressed in a third-order statistical dependency, that isto say in a non-vanishing third-order correlation coefficient.

[0070] In a cell, there are many partially still unknown regulatoryfeedback loops, the existence of which is expressed in complexstatistical relationships between expression rates.

[0071] Correlations are often represented by directed graphs betweenrandom variables (see, for example, David Edwards: “Introduction toGraphical Modeling”, Springer Texts in Statistics, Springer Verlag,1995). Such models are therefore also referred to as graphical models.

[0072] The high-dimensional probability distribution for the randomvariables $X = {\begin{pmatrix}X_{1} \\\ldots \\X_{k}\end{pmatrix} = \left( {X_{1},\ldots \quad,X_{k}} \right)^{T}}$

[0073] can be represented with the aid of a network or graph G, as shownin FIG. 2 for a simple example. The nodes 1, 2 and 3 correspond in thiscase to random variables X₁,X₂ , and X₃. In the scope of the statisticalmodeling of regulatory relationships in the genome, the random variablesare identified with the expression rates.

[0074] In graph G according to FIG. 2, dependencies are represented bydirected edges. In this case, the dependency of random variable X₂ onrandom variable X₁ is represented by a directed edge 12 from node 1 tonode 2. The dependency of random variable X₃ on random variable X₂ isrepresented by a directed edge 14 from node 2 to node 3.

[0075] If a second-order correlation is established, then this is shownin the graph by an edge between two nodes, that is to say between tworandom variables. In general, it is not possible to ascertain thedirection of this edge, that is to say which of the two random variablesis the cause of the other. Only the simultaneous occurrence is observed.Therefore, it is also not in general possible to ascertain which of thetwo involved genes or proteins regulates the other.

[0076] In certain cases, however, the direction of an edge can beascertained. FIG. 3A shows such a case. Three nodes 1, 2 and 3 areshown. Two edges are indicated between these three nodes, specificallythe edge 20 between nodes 1 and 3 and the edge 22 between nodes 2 and 3.Both edges are directed toward node 3. In graph theory, such a case isgenerally referred to as a “collider”. Statistically, in such aconstellation, a second-order correlation will be ascertained betweennodes 1 and 3, that is to say the associated random variables, as wellas a further second-order correlation between nodes 2 and 3. Nothird-order correlations, however, will be established since, forexample, random variables 1 and 3 influence each other but withouthaving an influence on random variable 2.

[0077] Put in terms of the regulatory interactions between genes orproteins, the graph according to FIG. 3A shows that gene 3 is regulatedby genes or proteins 1 and 2, but not vice versa. If gene 1 isexpressed, for example, then based on the model according to FIG. 3Agene 3 will also be expressed. This does not, however, imply that gene 2will also be expressed. If two second-order correlations are found, onebetween node 1 and node 3 and the other between node 2 and node 3, thenthe edges cannot be directed differently since otherwise a third-ordercorrelation would be shown (cf. FIG. 3B).

[0078] The situation is different in the case of FIG. 3B. FIG. 3B showsgraphs which essentially correspond to the graph according to FIG. 3A,and which are to be read in a similar way. Only the edges and theirdirections are varied. All the graphs shown in FIG. 3B indicateexclusively a third-order correlation between nodes 1, 2 and 3, and theycannot be discriminated on the basis of correlation analysis.

[0079] In general, it is very difficult to deduce post-translationalmodifications on the basis of gene expression patterns. However,third-order correlations give at least an indication of suchpost-translational modifications.

[0080] The identification of the graph associated with a regulatorynetwork will be explained in more detail below.

[0081] The common probability distribution of the random variables X₁,X₂ and X₃ in FIG. 2 can always be expressed by a product of conditionalprobabilities:

P(X₁,X₂, X₃)=P(X₃ |X₂, X₁)·P(X₂|X₁)·P(X₁).

[0082] In graph G according to FIG. 2, the conditional probabilities onthe right-hand side are represented by directed edges. In this case, theconditional probability P(X₂|X₁) is represented by a directed edge 12from node 1 to node 2. The conditional probability P(X₃|X₂,X₁) isrepresented by a directed edge 14 from node 2 to node 3. Such graphs Gare referred to as directed acyclic graphs (DAGs). The graphs G arecalled acyclic since, in the mathematical model being considered, thereis never a cyclic graph configuration in which, for example in FIG. 2, adirected edge also extends from node 3 to node 1, which would close acircle.

[0083] In the conditional probability P(X₃|X₂,X₁), the random variablesX₁ and X₂ represent the so-called parents (Pa) of the random variableX₃, that is to say

Pa(X₃)={X₁, X₂}

[0084] In general, therefore, a high-dimensional probabilitydistribution of the variables X_(i) can be written as${P\left( {X_{1},\ldots \quad,X_{k}} \right)} = {\prod\limits_{i = 1}^{k}{{P\left( {X_{i}{{Pa}\left( X_{i} \right)}} \right)}.}}$

[0085] In this case, Pa(X_(i)) denotes the set of parents of thevariable X_(i).

[0086] Statistical independencies can be determined in such a graph G byconsidering the parents of a random variable.

[0087] The structure of such a graph G is determined by comparison withobtained data, in the present case the measured expression patterns. Thestatistical problem can therefore be formulated in the following way: onthe basis of a data record $D = \begin{pmatrix}x_{1}^{(1)} & x_{2}^{(1)} & \cdots & x_{k}^{(1)} \\x_{1}^{(2)} & x_{2}^{(2)} & \cdots & x_{k}^{(2)} \\\vdots & \vdots & \quad & \vdots \\x_{1}^{(n)} & x_{2}^{(n)} & \cdots & x_{k}^{(n)}\end{pmatrix}$

[0088] of n embodiments of the random variables (X₁, . . . , X_(k)), thegraph G which best reproduces the data record D is looked for.

[0089] There are essentially two ways of deducing the structure of agraph G from the data D: The so-called “constrained based method” (R.Hofmann: “Lernen der Struktur nichtlinearer Abhängigkeiten mitgraphischen Modellen” (learning the structure of nonlinear dependencieswith graphical models), dissertation.de Berlin, 2000) and the so-called“score based method” (R. Hofmann: “Lernen der Struktur nichtlinearerAbhängigkeiten mit graphischen Modellen”, dissertation.de Berlin, 2000),which is perhaps preferred for implementation of the method and system.

[0090] The “constrained based method” attempts to deduce statisticaldependencies or independencies from the data, in a similar way to thatexplained above in connection with the estimation of correlationcoefficients.

[0091] The “score based method” searches through the space of thepossible graphs and evaluates the correspondence between the graphs andthe data with the aid of an evaluation function. The model that has thebest value of the evaluation function is selected. Possible evaluationfunctions are the Bayes' measure (D. Heckerman: “A Bayesian Approach tolearning causal networks”, Tech Report MSR-TR-95-04, Microsoft Research1995), the MDL metric (see below) or the BIC evaluation function (G.Schwarz: “Estimating the dimension of a model”, The Annals of Statistics6(2): 461-464 (1978)).

[0092] The evaluation function is the MDL metric. MDL stands for“minimum description length”. This evaluation function has the purposeof describing the data by a network, or a graph G, as accurately aspossible with the fewest possible edges. The evaluation function that isused is written:${L\left( {G,D} \right)} = {{\log \quad {P(G)}} - {n \cdot {H\left( {G,D} \right)}} - {\frac{1}{2}{K \cdot \log}\quad {n.}}}$

[0093] In this case, logP(G) is the a priori probability (in the senseof a Bayes' evaluation) of the graph G being found. IogP(G) is assumedto be equal for all graphs G. It can therefore be ignored during themaximization of L.

[0094] n is the number of available measured data records.${H\left( {G,D} \right)} = {\sum\limits_{i = 1}^{k}\quad {\sum\limits_{e = 1}^{E_{t}}\quad {\sum\limits_{l = 1}^{r_{i}}\quad {\sum\limits_{j = 1}^{q_{ei}}\quad {{- \frac{N_{ilej}}{n}}\log \frac{N_{ilej}}{N_{iej}}}}}}}$

[0095] reflects the conditional entropy of the graph G with respect tothe data D.

[0096] In this case, as mentioned above, k is the number of randomvariables X_(i), or the number of nodes i. This means that summation iscarried out over all the nodes.

[0097] E_(i) is the number of direct parents of node i, that is to saythe number of edges directed toward node i. This means that summation isadditionally carried out over all the edges directed toward node i.

[0098] r_(i) is the number of possible (discrete or discretized) valuesx_(i) which the random variable X_(i) can take, and therefore which thenode i can take. This means that summation is carried out over allpossible values of the random variable X_(i), or of the node i.

[0099] q_(ei) is the number of possible (discrete or discretized) valuesx_(ei) which the direct parent node e of node i, that is to say therandom variable X_(ei), can take. This means that summation isadditionally carried out over all possible values of the random variableX_(ei), or of the node e.

[0100] N_(ilej) is the number of data records in which node i has thevalue x_(l) and the direct parent node e has the value xj, counted overall n data records. This means that the edge between nodes i and e isconsidered, and a count is made of how often the associated values x_(l)and x_(j) occurred in the measured data records. The measured dataconverge here.

[0101] Lastly, the normalization is${N_{iej} = {\sum\limits_{l = 1}^{r_{i}}\quad N_{ilej}}},$

[0102] that is to say summation is carried out over all values which thenode i can assume.

[0103] The entropy is a non-negative measure of the uncertainty, whichis a maximum when the uncertainty is a maximum, and which vanishes whenthere is complete knowledge.

[0104] K is given by:$K = {\sum\limits_{i = 1}^{k}\quad {\sum\limits_{e = 1}^{E_{t}}\quad {q_{ei} \cdot {\left( {r_{i} - 1} \right).}}}}$

[0105] If the term “−1” in brackets is neglected, then K can be seen toreflect the number of all combinations of values, summed over all theedges. If the number of edges in a graph G is small, then as a rule K isalso small, so that L is correspondingly larger. This last term on theright-hand side hence increases the value of L for graphs with fewedges, so that it favors simple graphs. It is also referred to asevidence.

[0106] The evaluation function L corresponds approximately to thelogarithm of the Bayes' probability for the graph G when the data D havebeen observed. It hence corresponds to a certain extent to thelikelihood of the graph G. L is maximized, that is to say the graph Gwhich maximizes the function L for the given data D is looked for.

[0107] A particularly efficient way of finding the edges of the graph Ginvolves firstly assuming a set of independent random variables.Successively, the edge which most reduces the function L is added to thenetwork in each case. This is continued until a minimum of L isachieved.

[0108] As already mentioned, this can be carried out in, simple type ofembodiment with the aid of the following steps:

[0109] firstly, the single edge which minimizes L is looked for, that isto say the best first edge,

[0110] subsequently, the best second edge is looked for, that is to saythe second edge which, in addition to the already existing first edge,most substantially minimizes L,

[0111] etc., until L can no longer be minimized further.

[0112] This algorithm, which is very time-consuming, especially forhigh-dimensional data, can be accelerated decisively by fast,quasi-optimal search strategies for important dependencies. One knownalgorithm for this is the greedy algorithm mentioned above.

[0113] In order to find not only local maxima of the graph structure,known algorithms such as simulated annealing or genetic algorithms maybe used in combination with the algorithms described above, in order tolook for the optimum graph.

[0114] Suitable targets can be identified from the regulatory networkwhich has been deduced in such a way. For example, it can be seen inFIG. 1 that both gene A itself and also genes B, C, and D may be used asthe target for influencing the concentration or efficacy of effectorprotein A.

[0115] The invention has been described in detail with particularreference to preferred embodiments thereof and examples, but it will beunderstood that variations and modifications can be effected within thespirit and scope of the invention.

1. A method of identifying pharmaceutical targets, comprising:determining a plurality of gene expression patterns of a cell and foreach gene expression pattern, determining expression rates for genes ofthe cell; determining at least one dependency between the expressionrates of different genes of the cell; and deducing a regulatory networkof the cell from the at least one dependency.
 2. The method as claimedin claim 1, further comprising assuming that not all the expressionrates of the genes of the cell are mutually dependent.
 3. The method asclaimed in the claim 1, wherein a set of independent gene expressionrates is taken as an initially assumption; and modifying the initialassumption by successively assuming dependencies which most reduceerrors in the gene expression rates.
 4. The method as claimed in claim1, wherein a plurality of dependencies are determined, and thedependencies are determined with the aid of a graph theory method. 5.The method as claimed in claim 1, further comprising; artificiallymodifying the expression rate of at least one gene of the cell toproduce a modified gene expression rate; determining at least onemodified gene expression pattern of the cell based on the modified geneexpression rate; and comparing the modified gene expression pattern withat least one gene expression pattern without modification.
 6. The methodas claimed in the claim 2, wherein a set of independent gene expressionrates is taken as an initially assumption; and modifying the initialassumption by successively assuming dependencies which most reduceerrors in the gene expression rates.
 7. The method as claimed in claim6, wherein a plurality of dependencies are determined, and thedependencies are determined with the aid of a graph theory method. 8.The method as claimed in claim 7, further comprising; artificiallymodifying the expression rate of at least one gene of the cell toproduce a modified gene expression rate; determining at least onemodified gene expression pattern of the cell based on the modified geneexpression rate; and comparing the modified gene expression pattern withat least one gene expression pattern without modification.
 9. A systemto identify pharmaceutical targets, comprising: an expression unit todetermine a plurality of gene expression patterns of a cell, theexpression rate of the genes of the cell being determined in each case;a correlation unit to determine at least one correlation between theexpression rates of different genes of the cell; and a network unit todeduce a regulatory network of the cell from the at least onecorrelation that has been determined.
 10. A method of identifyingpharmaceutical proteins, comprising: determining a plurality of genepatterns for a cell; determining the rate at which genes are expressedas proteins in the gene patterns; determining dependencies between theexpression rates of different genes; developing a regulatory network forthe cell, based on the dependencies, to describe interrelationshipsbetween the expression rates of different genes; identifying a targetgene expressing a target protein; and using the regulatory network,identifying a protein which alters the expression rate of the targetgene.